Add fallback pruning for stalled finalization by pablodeymo · Pull Request #179 · lambdaclass/ethlambda

pablodeymo · 2026-03-03T15:37:19Z

Motivation

PR #170 adds retention-based pruning for blocks and states, but it only triggers when finalization
advances (finalized.slot > old_finalized_slot in update_checkpoints). If finalization stalls —
due to network issues, low participation, or other consensus failures — the States and
BlockHeaders/BlockBodies/BlockSignatures tables grow unbounded because the pruning condition
is never met.

Description

Adds an else branch to update_checkpoints in the storage module that calls prune_old_states
and prune_old_blocks on every head update when finalization doesn't advance. The prune methods
are already no-ops when within retention limits (STATES_TO_KEEP=900, BLOCKS_TO_KEEP=1800), so
this adds negligible overhead in the common case.

Adapted from Zeam's approach of running fallback pruning
when finalization is behind.

Key design decisions

Pruning lives entirely in the storage module — the consensus layer (on_tick, on_block)
is unaware of pruning logic
Runs on every head update instead of a periodic interval — simpler, more testable, and the
prune methods short-circuit when there's nothing to prune
Protected roots: [latest_finalized.root, latest_justified.root], matching the existing
finalization-gated path

Changes

File	Change
`crates/storage/src/store.rs`	Add fallback pruning `else` branch in `update_checkpoints`, add 2 tests
`crates/storage/src/lib.rs`	Export `StorageReadView`, `StorageWriteBatch`, `Table`

How to test

cargo test -p ethlambda-storage -- fallback_pruning
cargo test --workspace --release
cargo clippy --workspace --all-targets -- -D warnings
cargo fmt --all -- --check

When finalization stalls (e.g., network issues, low participation), the States and Block tables grow unbounded because retention-based pruning only triggers on finalization advancement. This adds a periodic check every 7200 slots (~8 hours) that calls prune_old_states and prune_old_blocks directly when finalization is far behind, adapted from Zeam's approach. The check is cheap (two integer comparisons per slot-0 tick) and the actual pruning reuses existing idempotent methods.

github-actions · 2026-03-03T15:37:47Z

🤖 Kimi Code Review

Review Summary

The PR introduces a fallback pruning mechanism for when finalization is stalled, which is a good defensive measure against unbounded storage growth. However, there are several issues that need attention:

Critical Issues

Race condition in on_tick (blockchain/src/store.rs:318-329): The check slot.saturating_sub(store.latest_finalized().slot) > PRUNING_FALLBACK_INTERVAL_SLOTS uses the finalized slot from the store, but this could change between the check and the actual pruning operation. This could lead to pruning when finalization has actually advanced.
Missing synchronization in periodic_prune (storage/src/store.rs:430-440): The method reads latest_finalized() and latest_justified() separately, which could lead to inconsistent state if these values change between reads. Should read both atomically.

Security & Correctness Issues

Incorrect pruning logic (storage/src/store.rs:430): The protected roots only include latest_finalized().root and latest_justified().root, but should also include all their ancestor blocks/states within the retention window. The current implementation might prune necessary historical data.
Test setup issue (storage/src/store.rs:1458-1467): In the test periodic_prune_removes_old_states_and_blocks, the finalized checkpoint is set to slot 0 but the justified checkpoint is set to slot 1. This is an invalid state since justified should always be ≥ finalized.

Performance & Best Practices

Redundant check (blockchain/src/store.rs:319): The slot > 0 check is redundant since slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS) will be false for slot 0 anyway.
Logging level inconsistency: The warning log in on_tick uses warn! but the actual pruning in periodic_prune uses info!. For consistency and operator awareness, both should probably be at the same level.
Magic number exposure: PRUNING_FALLBACK_INTERVAL_SLOTS is exported from storage but only used in blockchain. Consider keeping it private to storage and exposing a higher-level configuration method instead.

Suggested Fixes

// In blockchain/src/store.rs, fix the race condition:
let current_finalized_slot = store.latest_finalized().slot;
if slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
    && slot.saturating_sub(current_finalized_slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
    // ... rest of code
}

// In storage/src/store.rs, fix atomic reads:
pub fn periodic_prune(&mut self) {
    let (finalized_checkpoint, justified_checkpoint) = {
        let finalized = self.latest_finalized();
        let justified = self.latest_justified();
        (finalized, justified)
    };
    let protected_roots = [finalized_checkpoint.root, justified_checkpoint.root];
    // ... rest of code
}

Automated review by Kimi (Moonshot AI) · custom prompt

github-actions · 2026-03-03T15:39:45Z

🤖 Codex Code Review

High: periodic_prune can create dangling LiveChain entries by deleting non-finalized headers

periodic_prune only calls prune_old_states + prune_old_blocks (store.rs:430).
prune_old_blocks deletes from BlockHeaders/BlockBodies/BlockSignatures but not LiveChain (store.rs:654).
During finalization stalls, LiveChain is explicitly still the source of fork-choice candidates, so this can leave roots selectable by fork choice but missing headers/state metadata.
There are multiple expect(...) paths that assume header availability for key roots, e.g. store.rs:1101, store.rs:1108, blockchain/store.rs:652, blockchain/store.rs:656, blockchain/store.rs:668.
Suggested fix: when pruning blocks in stalled-finality mode, either prune corresponding LiveChain entries too, or avoid pruning any root still present in LiveChain.

Medium: off-by-one in fallback trigger delays first prune by one full interval

Condition uses slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS (blockchain/store.rs:321).
With finalized slot 0, first prune runs at slot 14400 (16h), not 7200 (8h) as documented.
If intended behavior is “prune every 7200 slots once lag reaches 7200”, this should be >=.

Medium (performance): full-table scan/sort on tick path

Each fallback run does full BlockHeaders scans + sort for both states and blocks (store.rs:569, store.rs:620).
Running this synchronously inside on_tick interval 0 (blockchain/store.rs:317) can cause latency spikes under large DBs.
Consider incremental/index-based pruning or bounded batch deletion per tick.

Could not run tests in this environment due rustup permission errors (/home/runner/.rustup/tmp not writable), so findings are from static review.

Automated review by OpenAI Codex · custom prompt

greptile-apps · 2026-03-03T15:42:45Z

Greptile Summary

This PR addresses a gap in PR #170's retention-based pruning: because prune_old_states / prune_old_blocks were only called when finalized.slot advanced, a prolonged consensus stall (network partition, low participation) allowed the States and BlockHeaders/BlockBodies/BlockSignatures tables to grow without bound. The fix wires a periodic fallback check into the existing on_tick interval-0 path and delegates to a new Store::periodic_prune() method that reuses the already-idempotent pruning helpers, protecting both the current finalized and justified roots.

Key points:

The three-condition guard (slot > 0, multiple of 7200, stall gap > 7200) is conservative by design: when finalization is healthy the check is a no-op (two comparisons + one metadata read per slot).
periodic_prune deliberately matches update_checkpoints's protected-roots set [finalized.root, justified.root], so no live checkpoint can be accidentally removed.
Two new unit tests cover the pruning and no-op paths at the Store level; the on_tick trigger conditions are straightforward.
One test comment contains misleading phrasing about block insertion counts that should be corrected for clarity.

Confidence Score: 4/5

Safe to merge; the periodic pruning implementation is conservative, reuses idempotent helpers, and the guard conditions prevent false positives when finalization is healthy.
The PR correctly implements fallback pruning for stalled finalization with conservative guard conditions. The three-part check ensures the code is a no-op when finalization is healthy. Protected-roots handling mirrors the existing update_checkpoints logic, so no valid checkpoint can be accidentally removed. The only finding is a misleading test comment that should be clarified for maintainability.
crates/storage/src/store.rs — test comment clarification

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["on_tick — interval 0"] --> B{slot > 0?}
    B -- No --> G[Skip pruning]
    B -- Yes --> C{slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0?}
    C -- No --> G
    C -- Yes --> D{"slot - finalized_slot\n> PRUNING_FALLBACK_INTERVAL_SLOTS?"}
    D -- No\nFinalization healthy --> G
    D -- Yes\nFinalization stalled --> E["warn! log stall event"]
    E --> F["store.periodic_prune()"]
    F --> F1["protected = [finalized.root, justified.root]"]
    F1 --> F2["prune_old_states(protected)\nprune_old_blocks(protected)"]
    F2 --> F3{anything pruned?}
    F3 -- Yes --> F4["info! log pruned counts"]
    F3 -- No --> G

_{Last reviewed commit: d6c7958}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

crates/storage/src/store.rs

github-actions · 2026-03-03T15:42:56Z

🤖 Claude Code Review

Here is my review of PR #179:

Review: Add periodic fallback pruning for stalled finalization

The motivation is clear and the implementation is straightforward. A few points to consider:

Notable: `periodic_prune` is narrower than `update_checkpoints`

periodic_prune (store.rs:430) only calls prune_old_states + prune_old_blocks. The update_checkpoints path also calls prune_gossip_signatures, prune_attestation_data_by_root, and prune_aggregated_payload_table for both aggregated tables.

GossipSignatures are deleted post-aggregation (per delete_gossip_signatures at blockchain/src/store.rs:206), so they don't accumulate across slots under normal stall conditions. However, if aggregation fails for a given validator (e.g., insufficient co-signers) the signature stays in the table indefinitely during a stall — prune_gossip_signatures(finalized_slot) cannot help because the finalized slot doesn't advance.

This is an existing pre-stall risk being widened by this PR. Whether it's in-scope here is a judgement call, but it's worth documenting as a known gap.

Minor: `latest_finalized()` is read 3 times in the same tick

In blockchain/src/store.rs interval-0 handler, latest_finalized() is called:

Line 321: slot.saturating_sub(store.latest_finalized().slot) — condition check
Line 326: finalized_slot = store.latest_finalized().slot — inside warn!
Inside periodic_prune() at store.rs:431: self.latest_finalized().root

That's 3 DB reads for the same checkpoint value that doesn't change between calls. The simplest fix is to read the checkpoint once before the if:

let finalized = store.latest_finalized();
if slot > 0
    && slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
    && slot.saturating_sub(finalized.slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
    warn!(%slot, finalized_slot = finalized.slot, "Finalization stalled, ...");
    store.periodic_prune();
}

periodic_prune would still do its own reads internally, but eliminating the duplicate in the caller is a clean win.

Minor: No test for the trigger condition

The two new tests exercise periodic_prune() directly. The 3-part guard in blockchain/src/store.rs (slot > 0, multiple-of check, lag check) has no test coverage. A small on_tick test that verifies:

pruning fires when slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0 and slot - finalized_slot > PRUNING_FALLBACK_INTERVAL_SLOTS
pruning does NOT fire when finalization is healthy (lag ≤ threshold)

...would close that gap and act as a regression guard if the condition logic is ever refactored.

Positive notes

The trigger condition is logically correct. saturating_sub safely handles any edge case where finalized_slot >= slot.
is_multiple_of is stable since Rust 1.86.0; the project targets 1.92.0. ✓
The correct ordering (prune_old_states before prune_old_blocks) is preserved — states depend on block headers for slot lookup.
The 7-day-equivalent constant (7200 slots ≈ 8 hours) is well-chosen and consistent with Zeam's approach.
The condition truly is a no-op when finalization is healthy: two integer comparisons per slot-0 tick.
Test math for periodic_prune_removes_old_states_and_blocks is correct: 5 excess entries minus 2 protected (finalized + justified) = 3 deleted → STATES_TO_KEEP + 2 remaining.
warn! log level is appropriate for a stalled finalization event.

Automated review by Claude (Anthropic) · custom prompt

…s 3 redundant DB reads for the same checkpoint value that doesnt change between calls in the interval-0 handler.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Two tests that verify the 3-part guard in on_tick interval 0: - on_tick_triggers_periodic_pruning_when_finalization_stalled: pruning fires when slot is a multiple of the interval AND finalization lag exceeds the threshold - on_tick_skips_periodic_pruning_when_finalization_healthy: pruning does NOT fire when finalization is recent (lag within threshold) Also exports Table, StorageWriteBatch, and StorageReadView from the storage crate to enable direct backend access in blockchain tests.

crates/blockchain/src/store.rs

in crates/storage/src/store.rs (constants and test sections coexist, periodic pruning tests updated to use test_store_with_backend helper)

periodic_prune now takes a slot parameter and handles the 3-part trigger condition (slot > 0, multiple of interval, finalization lag) internally. The blockchain on_tick handler just calls store.periodic_prune(slot).

Check cheap arithmetic conditions (slot == 0, is_multiple_of) before reading latest_finalized from the database. This avoids a wasted DB read on 7199 out of 7200 calls.

Fallback pruning now lives entirely in the storage module's update_checkpoints method as an else branch: when finalization doesn't advance, it checks the head slot and prunes if finalization is stalled. This keeps the consensus layer completely unaware of pruning logic. Removes periodic_prune as a standalone method, the on_tick call, and the blockchain-level pruning tests (now covered by storage tests).

…7200 slots Remove the interval-based guard and PRUNING_FALLBACK_INTERVAL_SLOTS constant. The else branch in update_checkpoints now unconditionally calls prune_old_states and prune_old_blocks when finalization doesn't advance. The prune methods are already no-ops when within retention limits, so this is safe and more testable.

MegaRedHand and others added 4 commits March 2, 2026 11:17

feat: prune states and blocks after a while

4266d0e

chore: run make fmt

9b373c5

Merge branch 'main' into prune-states

278c92d

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

crates/storage/src/store.rs Outdated Show resolved Hide resolved

pablodeymo and others added 3 commits March 3, 2026 12:52

Read latest_finalized() once before the periodic pruning check. Avoid…

040a553

…s 3 redundant DB reads for the same checkpoint value that doesnt change between calls in the interval-0 handler.

Update crates/storage/src/store.rs

8ef5e9d

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Base automatically changed from prune-states to main March 3, 2026 16:16

Merge branch 'main' into periodic-fallback-pruning

14a311f

MegaRedHand reviewed Mar 4, 2026

View reviewed changes

crates/blockchain/src/store.rs Outdated Show resolved Hide resolved

pablodeymo added 6 commits March 4, 2026 15:20

Merge main into periodic-fallback-pruning, resolving conflicts

4b135ee

in crates/storage/src/store.rs (constants and test sections coexist, periodic pruning tests updated to use test_store_with_backend helper)

Move periodic pruning guard logic from blockchain to storage module

32ee526

periodic_prune now takes a slot parameter and handles the 3-part trigger condition (slot > 0, multiple of interval, finalization lag) internally. The blockchain on_tick handler just calls store.periodic_prune(slot).

Reorder periodic_prune guards to avoid DB read on every slot

d4e19c1

Check cheap arithmetic conditions (slot == 0, is_multiple_of) before reading latest_finalized from the database. This avoids a wasted DB read on 7199 out of 7200 calls.

Run cargo fmt

120d116

pablodeymo changed the title ~~Add periodic fallback pruning for stalled finalization~~ Add fallback pruning for stalled finalization Mar 4, 2026

MegaRedHand approved these changes Mar 5, 2026

View reviewed changes

MegaRedHand merged commit 8605bea into main Mar 5, 2026
2 checks passed

MegaRedHand deleted the periodic-fallback-pruning branch March 5, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fallback pruning for stalled finalization#179

Add fallback pruning for stalled finalization#179
MegaRedHand merged 14 commits intomainfrom
periodic-fallback-pruning

pablodeymo commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pablodeymo commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Key design decisions

Changes

How to test

Uh oh!

github-actions bot commented Mar 3, 2026

🤖 Kimi Code Review

Review Summary

Critical Issues

Security & Correctness Issues

Performance & Best Practices

Suggested Fixes

Uh oh!

github-actions bot commented Mar 3, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps bot commented Mar 3, 2026

Greptile Summary

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

🤖 Claude Code Review

Review: Add periodic fallback pruning for stalled finalization

Notable: periodic_prune is narrower than update_checkpoints

Minor: latest_finalized() is read 3 times in the same tick

Minor: No test for the trigger condition

Positive notes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pablodeymo commented Mar 3, 2026 •

edited

Loading

Notable: `periodic_prune` is narrower than `update_checkpoints`

Minor: `latest_finalized()` is read 3 times in the same tick